PISA is a survey that already took place the 5th time in 2012 to assess the competencies in reading, mathematics and science (focus on mathematics) of 15 year-old students in 65 countries and economies. This dataset contains the information for each student.
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import re
%matplotlib inline
#Load dataset
df = pd.read_csv('pisa2012.csv', low_memory = False)
df.head()
| Unnamed: 0 | CNT | SUBNATIO | STRATUM | OECD | NC | SCHOOLID | STIDSTD | ST01Q01 | ST02Q01 | ... | W_FSTR75 | W_FSTR76 | W_FSTR77 | W_FSTR78 | W_FSTR79 | W_FSTR80 | WVARSTRR | VAR_UNIT | SENWGT_STU | VER_STU | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Albania | 80000 | ALB0006 | Non-OECD | Albania | 1 | 1 | 10 | 1.0 | ... | 13.7954 | 13.9235 | 13.1249 | 13.1249 | 4.3389 | 13.0829 | 19 | 1 | 0.2098 | 22NOV13 |
| 1 | 2 | Albania | 80000 | ALB0006 | Non-OECD | Albania | 1 | 2 | 10 | 1.0 | ... | 13.7954 | 13.9235 | 13.1249 | 13.1249 | 4.3389 | 13.0829 | 19 | 1 | 0.2098 | 22NOV13 |
| 2 | 3 | Albania | 80000 | ALB0006 | Non-OECD | Albania | 1 | 3 | 9 | 1.0 | ... | 12.7307 | 12.7307 | 12.7307 | 12.7307 | 4.2436 | 12.7307 | 19 | 1 | 0.1999 | 22NOV13 |
| 3 | 4 | Albania | 80000 | ALB0006 | Non-OECD | Albania | 1 | 4 | 9 | 1.0 | ... | 12.7307 | 12.7307 | 12.7307 | 12.7307 | 4.2436 | 12.7307 | 19 | 1 | 0.1999 | 22NOV13 |
| 4 | 5 | Albania | 80000 | ALB0006 | Non-OECD | Albania | 1 | 5 | 9 | 1.0 | ... | 12.7307 | 12.7307 | 12.7307 | 12.7307 | 4.2436 | 12.7307 | 19 | 1 | 0.1999 | 22NOV13 |
5 rows × 636 columns
print(df.info(verbose=True))
<class 'pandas.core.frame.DataFrame'> RangeIndex: 485490 entries, 0 to 485489 Data columns (total 636 columns): # Column Dtype --- ------ ----- 0 Unnamed: 0 int64 1 CNT object 2 SUBNATIO int64 3 STRATUM object 4 OECD object 5 NC object 6 SCHOOLID int64 7 STIDSTD int64 8 ST01Q01 int64 9 ST02Q01 float64 10 ST03Q01 int64 11 ST03Q02 int64 12 ST04Q01 object 13 ST05Q01 object 14 ST06Q01 float64 15 ST07Q01 object 16 ST07Q02 object 17 ST07Q03 object 18 ST08Q01 object 19 ST09Q01 object 20 ST115Q01 float64 21 ST11Q01 object 22 ST11Q02 object 23 ST11Q03 object 24 ST11Q04 object 25 ST11Q05 object 26 ST11Q06 object 27 ST13Q01 object 28 ST14Q01 object 29 ST14Q02 object 30 ST14Q03 object 31 ST14Q04 object 32 ST15Q01 object 33 ST17Q01 object 34 ST18Q01 object 35 ST18Q02 object 36 ST18Q03 object 37 ST18Q04 object 38 ST19Q01 object 39 ST20Q01 object 40 ST20Q02 object 41 ST20Q03 object 42 ST21Q01 float64 43 ST25Q01 object 44 ST26Q01 object 45 ST26Q02 object 46 ST26Q03 object 47 ST26Q04 object 48 ST26Q05 object 49 ST26Q06 object 50 ST26Q07 object 51 ST26Q08 object 52 ST26Q09 object 53 ST26Q10 object 54 ST26Q11 object 55 ST26Q12 object 56 ST26Q13 object 57 ST26Q14 object 58 ST26Q15 int64 59 ST26Q16 int64 60 ST26Q17 int64 61 ST27Q01 object 62 ST27Q02 object 63 ST27Q03 object 64 ST27Q04 object 65 ST27Q05 object 66 ST28Q01 object 67 ST29Q01 object 68 ST29Q02 object 69 ST29Q03 object 70 ST29Q04 object 71 ST29Q05 object 72 ST29Q06 object 73 ST29Q07 object 74 ST29Q08 object 75 ST35Q01 object 76 ST35Q02 object 77 ST35Q03 object 78 ST35Q04 object 79 ST35Q05 object 80 ST35Q06 object 81 ST37Q01 object 82 ST37Q02 object 83 ST37Q03 object 84 ST37Q04 object 85 ST37Q05 object 86 ST37Q06 object 87 ST37Q07 object 88 ST37Q08 object 89 ST42Q01 object 90 ST42Q02 object 91 ST42Q03 object 92 ST42Q04 object 93 ST42Q05 object 94 ST42Q06 object 95 ST42Q07 object 96 ST42Q08 object 97 ST42Q09 object 98 ST42Q10 object 99 ST43Q01 object 100 ST43Q02 object 101 ST43Q03 object 102 ST43Q04 object 103 ST43Q05 object 104 ST43Q06 object 105 ST44Q01 object 106 ST44Q03 object 107 ST44Q04 object 108 ST44Q05 object 109 ST44Q07 object 110 ST44Q08 object 111 ST46Q01 object 112 ST46Q02 object 113 ST46Q03 object 114 ST46Q04 object 115 ST46Q05 object 116 ST46Q06 object 117 ST46Q07 object 118 ST46Q08 object 119 ST46Q09 object 120 ST48Q01 object 121 ST48Q02 object 122 ST48Q03 object 123 ST48Q04 object 124 ST48Q05 object 125 ST49Q01 object 126 ST49Q02 object 127 ST49Q03 object 128 ST49Q04 object 129 ST49Q05 object 130 ST49Q06 object 131 ST49Q07 object 132 ST49Q09 object 133 ST53Q01 object 134 ST53Q02 object 135 ST53Q03 object 136 ST53Q04 object 137 ST55Q01 object 138 ST55Q02 object 139 ST55Q03 object 140 ST55Q04 object 141 ST57Q01 float64 142 ST57Q02 float64 143 ST57Q03 float64 144 ST57Q04 float64 145 ST57Q05 float64 146 ST57Q06 float64 147 ST61Q01 object 148 ST61Q02 object 149 ST61Q03 object 150 ST61Q04 object 151 ST61Q05 object 152 ST61Q06 object 153 ST61Q07 object 154 ST61Q08 object 155 ST61Q09 object 156 ST62Q01 object 157 ST62Q02 object 158 ST62Q03 object 159 ST62Q04 object 160 ST62Q06 object 161 ST62Q07 object 162 ST62Q08 object 163 ST62Q09 object 164 ST62Q10 object 165 ST62Q11 object 166 ST62Q12 object 167 ST62Q13 object 168 ST62Q15 object 169 ST62Q16 object 170 ST62Q17 object 171 ST62Q19 object 172 ST69Q01 float64 173 ST69Q02 float64 174 ST69Q03 float64 175 ST70Q01 float64 176 ST70Q02 float64 177 ST70Q03 float64 178 ST71Q01 float64 179 ST72Q01 float64 180 ST73Q01 object 181 ST73Q02 object 182 ST74Q01 object 183 ST74Q02 object 184 ST75Q01 object 185 ST75Q02 object 186 ST76Q01 object 187 ST76Q02 object 188 ST77Q01 object 189 ST77Q02 object 190 ST77Q04 object 191 ST77Q05 object 192 ST77Q06 object 193 ST79Q01 object 194 ST79Q02 object 195 ST79Q03 object 196 ST79Q04 object 197 ST79Q05 object 198 ST79Q06 object 199 ST79Q07 object 200 ST79Q08 object 201 ST79Q10 object 202 ST79Q11 object 203 ST79Q12 object 204 ST79Q15 object 205 ST79Q17 object 206 ST80Q01 object 207 ST80Q04 object 208 ST80Q05 object 209 ST80Q06 object 210 ST80Q07 object 211 ST80Q08 object 212 ST80Q09 object 213 ST80Q10 object 214 ST80Q11 object 215 ST81Q01 object 216 ST81Q02 object 217 ST81Q03 object 218 ST81Q04 object 219 ST81Q05 object 220 ST82Q01 object 221 ST82Q02 object 222 ST82Q03 object 223 ST83Q01 object 224 ST83Q02 object 225 ST83Q03 object 226 ST83Q04 object 227 ST84Q01 object 228 ST84Q02 object 229 ST84Q03 object 230 ST85Q01 object 231 ST85Q02 object 232 ST85Q03 object 233 ST85Q04 object 234 ST86Q01 object 235 ST86Q02 object 236 ST86Q03 object 237 ST86Q04 object 238 ST86Q05 object 239 ST87Q01 object 240 ST87Q02 object 241 ST87Q03 object 242 ST87Q04 object 243 ST87Q05 object 244 ST87Q06 object 245 ST87Q07 object 246 ST87Q08 object 247 ST87Q09 object 248 ST88Q01 object 249 ST88Q02 object 250 ST88Q03 object 251 ST88Q04 object 252 ST89Q02 object 253 ST89Q03 object 254 ST89Q04 object 255 ST89Q05 object 256 ST91Q01 object 257 ST91Q02 object 258 ST91Q03 object 259 ST91Q04 object 260 ST91Q05 object 261 ST91Q06 object 262 ST93Q01 object 263 ST93Q03 object 264 ST93Q04 object 265 ST93Q06 object 266 ST93Q07 object 267 ST94Q05 object 268 ST94Q06 object 269 ST94Q09 object 270 ST94Q10 object 271 ST94Q14 object 272 ST96Q01 object 273 ST96Q02 object 274 ST96Q03 object 275 ST96Q05 object 276 ST101Q01 float64 277 ST101Q02 float64 278 ST101Q03 float64 279 ST101Q05 float64 280 ST104Q01 float64 281 ST104Q04 float64 282 ST104Q05 float64 283 ST104Q06 float64 284 IC01Q01 object 285 IC01Q02 object 286 IC01Q03 object 287 IC01Q04 object 288 IC01Q05 object 289 IC01Q06 object 290 IC01Q07 object 291 IC01Q08 object 292 IC01Q09 object 293 IC01Q10 object 294 IC01Q11 object 295 IC02Q01 object 296 IC02Q02 object 297 IC02Q03 object 298 IC02Q04 object 299 IC02Q05 object 300 IC02Q06 object 301 IC02Q07 object 302 IC03Q01 object 303 IC04Q01 object 304 IC05Q01 int64 305 IC06Q01 int64 306 IC07Q01 int64 307 IC08Q01 object 308 IC08Q02 object 309 IC08Q03 object 310 IC08Q04 object 311 IC08Q05 object 312 IC08Q06 object 313 IC08Q07 object 314 IC08Q08 object 315 IC08Q09 object 316 IC08Q11 object 317 IC09Q01 object 318 IC09Q02 object 319 IC09Q03 object 320 IC09Q04 object 321 IC09Q05 object 322 IC09Q06 object 323 IC09Q07 object 324 IC10Q01 object 325 IC10Q02 object 326 IC10Q03 object 327 IC10Q04 object 328 IC10Q05 object 329 IC10Q06 object 330 IC10Q07 object 331 IC10Q08 object 332 IC10Q09 object 333 IC11Q01 object 334 IC11Q02 object 335 IC11Q03 object 336 IC11Q04 object 337 IC11Q05 object 338 IC11Q06 object 339 IC11Q07 object 340 IC22Q01 object 341 IC22Q02 object 342 IC22Q04 object 343 IC22Q06 object 344 IC22Q07 object 345 IC22Q08 object 346 EC01Q01 object 347 EC02Q01 object 348 EC03Q01 object 349 EC03Q02 object 350 EC03Q03 object 351 EC03Q04 object 352 EC03Q05 object 353 EC03Q06 object 354 EC03Q07 object 355 EC03Q08 object 356 EC03Q09 object 357 EC03Q10 object 358 EC04Q01A float64 359 EC04Q01B float64 360 EC04Q01C float64 361 EC04Q02A float64 362 EC04Q02B float64 363 EC04Q02C float64 364 EC04Q03A float64 365 EC04Q03B float64 366 EC04Q03C float64 367 EC04Q04A float64 368 EC04Q04B float64 369 EC04Q04C float64 370 EC04Q05A float64 371 EC04Q05B float64 372 EC04Q05C float64 373 EC04Q06A float64 374 EC04Q06B float64 375 EC04Q06C float64 376 EC05Q01 object 377 EC06Q01 object 378 EC07Q01 object 379 EC07Q02 object 380 EC07Q03 object 381 EC07Q04 object 382 EC07Q05 object 383 EC08Q01 object 384 EC08Q02 object 385 EC08Q03 object 386 EC08Q04 object 387 EC09Q03 object 388 EC10Q01 object 389 EC11Q02 object 390 EC11Q03 object 391 EC12Q01 object 392 ST22Q01 object 393 ST23Q01 object 394 ST23Q02 object 395 ST23Q03 object 396 ST23Q04 object 397 ST23Q05 object 398 ST23Q06 object 399 ST23Q07 object 400 ST23Q08 object 401 ST24Q01 object 402 ST24Q02 object 403 ST24Q03 object 404 CLCUSE1 object 405 CLCUSE301 int64 406 CLCUSE302 int64 407 DEFFORT int64 408 QUESTID object 409 BOOKID object 410 EASY object 411 AGE float64 412 GRADE float64 413 PROGN object 414 ANXMAT float64 415 ATSCHL float64 416 ATTLNACT float64 417 BELONG float64 418 BFMJ2 float64 419 BMMJ1 float64 420 CLSMAN float64 421 COBN_F object 422 COBN_M object 423 COBN_S object 424 COGACT float64 425 CULTDIST float64 426 CULTPOS float64 427 DISCLIMA float64 428 ENTUSE float64 429 ESCS float64 430 EXAPPLM float64 431 EXPUREM float64 432 FAILMAT float64 433 FAMCON float64 434 FAMCONC float64 435 FAMSTRUC float64 436 FISCED object 437 HEDRES float64 438 HERITCUL float64 439 HISCED object 440 HISEI float64 441 HOMEPOS float64 442 HOMSCH float64 443 HOSTCUL float64 444 ICTATTNEG float64 445 ICTATTPOS float64 446 ICTHOME float64 447 ICTRES float64 448 ICTSCH float64 449 IMMIG object 450 INFOCAR float64 451 INFOJOB1 float64 452 INFOJOB2 float64 453 INSTMOT float64 454 INTMAT float64 455 ISCEDD object 456 ISCEDL object 457 ISCEDO object 458 LANGCOMM float64 459 LANGN object 460 LANGRPPD float64 461 LMINS float64 462 MATBEH float64 463 MATHEFF float64 464 MATINTFC float64 465 MATWKETH float64 466 MISCED object 467 MMINS float64 468 MTSUP float64 469 OCOD1 object 470 OCOD2 object 471 OPENPS float64 472 OUTHOURS float64 473 PARED float64 474 PERSEV float64 475 REPEAT object 476 SCMAT float64 477 SMINS float64 478 STUDREL float64 479 SUBNORM float64 480 TCHBEHFA float64 481 TCHBEHSO float64 482 TCHBEHTD float64 483 TEACHSUP float64 484 TESTLANG object 485 TIMEINT float64 486 USEMATH float64 487 USESCH float64 488 WEALTH float64 489 ANCATSCHL float64 490 ANCATTLNACT float64 491 ANCBELONG float64 492 ANCCLSMAN float64 493 ANCCOGACT float64 494 ANCINSTMOT float64 495 ANCINTMAT float64 496 ANCMATWKETH float64 497 ANCMTSUP float64 498 ANCSCMAT float64 499 ANCSTUDREL float64 500 ANCSUBNORM float64 501 PV1MATH float64 502 PV2MATH float64 503 PV3MATH float64 504 PV4MATH float64 505 PV5MATH float64 506 PV1MACC float64 507 PV2MACC float64 508 PV3MACC float64 509 PV4MACC float64 510 PV5MACC float64 511 PV1MACQ float64 512 PV2MACQ float64 513 PV3MACQ float64 514 PV4MACQ float64 515 PV5MACQ float64 516 PV1MACS float64 517 PV2MACS float64 518 PV3MACS float64 519 PV4MACS float64 520 PV5MACS float64 521 PV1MACU float64 522 PV2MACU float64 523 PV3MACU float64 524 PV4MACU float64 525 PV5MACU float64 526 PV1MAPE float64 527 PV2MAPE float64 528 PV3MAPE float64 529 PV4MAPE float64 530 PV5MAPE float64 531 PV1MAPF float64 532 PV2MAPF float64 533 PV3MAPF float64 534 PV4MAPF float64 535 PV5MAPF float64 536 PV1MAPI float64 537 PV2MAPI float64 538 PV3MAPI float64 539 PV4MAPI float64 540 PV5MAPI float64 541 PV1READ float64 542 PV2READ float64 543 PV3READ float64 544 PV4READ float64 545 PV5READ float64 546 PV1SCIE float64 547 PV2SCIE float64 548 PV3SCIE float64 549 PV4SCIE float64 550 PV5SCIE float64 551 W_FSTUWT float64 552 W_FSTR1 float64 553 W_FSTR2 float64 554 W_FSTR3 float64 555 W_FSTR4 float64 556 W_FSTR5 float64 557 W_FSTR6 float64 558 W_FSTR7 float64 559 W_FSTR8 float64 560 W_FSTR9 float64 561 W_FSTR10 float64 562 W_FSTR11 float64 563 W_FSTR12 float64 564 W_FSTR13 float64 565 W_FSTR14 float64 566 W_FSTR15 float64 567 W_FSTR16 float64 568 W_FSTR17 float64 569 W_FSTR18 float64 570 W_FSTR19 float64 571 W_FSTR20 float64 572 W_FSTR21 float64 573 W_FSTR22 float64 574 W_FSTR23 float64 575 W_FSTR24 float64 576 W_FSTR25 float64 577 W_FSTR26 float64 578 W_FSTR27 float64 579 W_FSTR28 float64 580 W_FSTR29 float64 581 W_FSTR30 float64 582 W_FSTR31 float64 583 W_FSTR32 float64 584 W_FSTR33 float64 585 W_FSTR34 float64 586 W_FSTR35 float64 587 W_FSTR36 float64 588 W_FSTR37 float64 589 W_FSTR38 float64 590 W_FSTR39 float64 591 W_FSTR40 float64 592 W_FSTR41 float64 593 W_FSTR42 float64 594 W_FSTR43 float64 595 W_FSTR44 float64 596 W_FSTR45 float64 597 W_FSTR46 float64 598 W_FSTR47 float64 599 W_FSTR48 float64 600 W_FSTR49 float64 601 W_FSTR50 float64 602 W_FSTR51 float64 603 W_FSTR52 float64 604 W_FSTR53 float64 605 W_FSTR54 float64 606 W_FSTR55 float64 607 W_FSTR56 float64 608 W_FSTR57 float64 609 W_FSTR58 float64 610 W_FSTR59 float64 611 W_FSTR60 float64 612 W_FSTR61 float64 613 W_FSTR62 float64 614 W_FSTR63 float64 615 W_FSTR64 float64 616 W_FSTR65 float64 617 W_FSTR66 float64 618 W_FSTR67 float64 619 W_FSTR68 float64 620 W_FSTR69 float64 621 W_FSTR70 float64 622 W_FSTR71 float64 623 W_FSTR72 float64 624 W_FSTR73 float64 625 W_FSTR74 float64 626 W_FSTR75 float64 627 W_FSTR76 float64 628 W_FSTR77 float64 629 W_FSTR78 float64 630 W_FSTR79 float64 631 W_FSTR80 float64 632 WVARSTRR int64 633 VAR_UNIT int64 634 SENWGT_STU float64 635 VER_STU object dtypes: float64(250), int64(18), object(368) memory usage: 2.3+ GB None
print(df.shape)
print(df.dtypes.head(30))
(485490, 636) Unnamed: 0 int64 CNT object SUBNATIO int64 STRATUM object OECD object NC object SCHOOLID int64 STIDSTD int64 ST01Q01 int64 ST02Q01 float64 ST03Q01 int64 ST03Q02 int64 ST04Q01 object ST05Q01 object ST06Q01 float64 ST07Q01 object ST07Q02 object ST07Q03 object ST08Q01 object ST09Q01 object ST115Q01 float64 ST11Q01 object ST11Q02 object ST11Q03 object ST11Q04 object ST11Q05 object ST11Q06 object ST13Q01 object ST14Q01 object ST14Q02 object dtype: object
df.describe()
| Unnamed: 0 | SUBNATIO | SCHOOLID | STIDSTD | ST01Q01 | ST02Q01 | ST03Q01 | ST03Q02 | ST06Q01 | ST115Q01 | ... | W_FSTR74 | W_FSTR75 | W_FSTR76 | W_FSTR77 | W_FSTR78 | W_FSTR79 | W_FSTR80 | WVARSTRR | VAR_UNIT | SENWGT_STU | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 485490.000000 | 4.854900e+05 | 485490.000000 | 485490.000000 | 485490.000000 | 485438.000000 | 485490.000000 | 485490.000000 | 457994.000000 | 479269.000000 | ... | 485490.000000 | 485490.000000 | 485490.000000 | 485490.000000 | 485490.000000 | 485490.000000 | 485490.000000 | 485490.000000 | 485490.000000 | 485490.000000 |
| mean | 242745.500000 | 4.315457e+06 | 240.152197 | 6134.066201 | 9.813323 | 2.579260 | 6.558512 | 1996.070061 | 6.148963 | 1.265356 | ... | 50.844201 | 51.020378 | 50.943149 | 50.685275 | 51.019842 | 50.540724 | 50.721164 | 40.013920 | 1.531189 | 0.140054 |
| std | 140149.035431 | 2.524434e+06 | 278.563016 | 6733.144944 | 3.734726 | 2.694013 | 3.705244 | 0.255250 | 0.970693 | 0.578992 | ... | 120.684726 | 122.946533 | 121.170883 | 119.267686 | 122.981541 | 119.479516 | 119.799018 | 22.951264 | 0.539759 | 0.137864 |
| min | 1.000000 | 8.000000e+04 | 1.000000 | 1.000000 | 7.000000 | 1.000000 | 1.000000 | 1996.000000 | 4.000000 | 1.000000 | ... | 0.292900 | 0.292900 | 0.292900 | 0.292900 | 0.292900 | 0.292900 | 0.292900 | 1.000000 | 1.000000 | 0.000500 |
| 25% | 121373.250000 | 2.030000e+06 | 61.000000 | 1811.000000 | 9.000000 | 1.000000 | 4.000000 | 1996.000000 | 6.000000 | 1.000000 | ... | 4.660300 | 4.664800 | 4.643100 | 4.667000 | 4.675200 | 4.651850 | 4.660300 | 20.000000 | 1.000000 | 0.037800 |
| 50% | 242745.500000 | 4.100000e+06 | 136.000000 | 3740.000000 | 10.000000 | 1.000000 | 7.000000 | 1996.000000 | 6.000000 | 1.000000 | ... | 13.637700 | 13.698900 | 13.611700 | 13.672100 | 13.731100 | 13.582000 | 13.600200 | 40.000000 | 2.000000 | 0.145200 |
| 75% | 364117.750000 | 6.880000e+06 | 291.000000 | 7456.000000 | 10.000000 | 3.000000 | 9.000000 | 1996.000000 | 7.000000 | 1.000000 | ... | 41.233500 | 41.512500 | 41.695200 | 41.097300 | 41.189600 | 41.290925 | 41.356000 | 60.000000 | 2.000000 | 0.199900 |
| max | 485490.000000 | 8.580000e+06 | 1471.000000 | 33806.000000 | 96.000000 | 25.000000 | 99.000000 | 1997.000000 | 16.000000 | 4.000000 | ... | 2476.566800 | 4155.283000 | 3743.450100 | 3232.163700 | 3904.868100 | 3607.478300 | 3412.174100 | 80.000000 | 3.000000 | 5.095500 |
8 rows × 268 columns
#Filter columns by math work ethics
pattern = re.compile(r'st42', re.IGNORECASE)
columns = df.filter(regex=pattern)
columns
| ST42Q01 | ST42Q02 | ST42Q03 | ST42Q04 | ST42Q05 | ST42Q06 | ST42Q07 | ST42Q08 | ST42Q09 | ST42Q10 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Agree | Disagree | Agree | Agree | Agree | Agree | Agree | Disagree | Disagree | Disagree |
| 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | NaN | Strongly agree | Disagree | Agree | Agree | Disagree | Strongly agree | Disagree | Agree | Agree |
| 4 | Strongly agree | Strongly agree | Agree | Strongly agree | Strongly agree | Disagree | Disagree | Disagree | Agree | Agree |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 485485 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 485486 | Agree | Disagree | Disagree | Agree | Disagree | Agree | Agree | Disagree | Agree | Disagree |
| 485487 | Agree | Disagree | Disagree | Agree | Disagree | Agree | Disagree | Agree | Disagree | Agree |
| 485488 | Disagree | Disagree | Strongly disagree | Disagree | Disagree | Agree | Agree | Disagree | Disagree | Strongly agree |
| 485489 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
485490 rows × 10 columns
df['ST04Q01'].value_counts()
Female 245064 Male 240426 Name: ST04Q01, dtype: int64
#Get different values for one column
columns['ST46Q01'].value_counts()
Agree 148211 Strongly agree 77022 Disagree 70473 Strongly disagree 18192 Name: ST46Q01, dtype: int64
ordinal_var_dict = {'ST42Q01': ['Strongly agree','Agree','Disagree','Strongly disagree'],
'ST42Q03': ['Strongly agree','Agree','Disagree','Strongly disagree'],
'ST42Q05': ['Strongly agree','Agree','Disagree','Strongly disagree'],
'ST42Q08': ['Strongly agree','Agree','Disagree','Strongly disagree'],
'ST42Q10': ['Strongly agree','Agree','Disagree','Strongly disagree']}
for var in ordinal_var_dict:
ordered_var = pd.api.types.CategoricalDtype(ordered = True,
categories = ordinal_var_dict[var])
df[var] = df[var].astype(ordered_var)
pattern = re.compile(r'W_', re.IGNORECASE)
columns = df.filter(regex=pattern)
columns
| W_FSTUWT | W_FSTR1 | W_FSTR2 | W_FSTR3 | W_FSTR4 | W_FSTR5 | W_FSTR6 | W_FSTR7 | W_FSTR8 | W_FSTR9 | ... | W_FSTR71 | W_FSTR72 | W_FSTR73 | W_FSTR74 | W_FSTR75 | W_FSTR76 | W_FSTR77 | W_FSTR78 | W_FSTR79 | W_FSTR80 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 8.9096 | 13.1249 | 13.0829 | 4.5315 | 13.0829 | 13.9235 | 13.1249 | 13.1249 | 4.3389 | 4.3313 | ... | 13.0829 | 13.9235 | 4.3389 | 4.3313 | 13.7954 | 13.9235 | 13.1249 | 13.1249 | 4.3389 | 13.0829 |
| 1 | 8.9096 | 13.1249 | 13.0829 | 4.5315 | 13.0829 | 13.9235 | 13.1249 | 13.1249 | 4.3389 | 4.3313 | ... | 13.0829 | 13.9235 | 4.3389 | 4.3313 | 13.7954 | 13.9235 | 13.1249 | 13.1249 | 4.3389 | 13.0829 |
| 2 | 8.4871 | 12.7307 | 12.7307 | 4.2436 | 12.7307 | 12.7307 | 12.7307 | 12.7307 | 4.2436 | 4.2436 | ... | 12.7307 | 12.7307 | 4.2436 | 4.2436 | 12.7307 | 12.7307 | 12.7307 | 12.7307 | 4.2436 | 12.7307 |
| 3 | 8.4871 | 12.7307 | 12.7307 | 4.2436 | 12.7307 | 12.7307 | 12.7307 | 12.7307 | 4.2436 | 4.2436 | ... | 12.7307 | 12.7307 | 4.2436 | 4.2436 | 12.7307 | 12.7307 | 12.7307 | 12.7307 | 4.2436 | 12.7307 |
| 4 | 8.4871 | 12.7307 | 12.7307 | 4.2436 | 12.7307 | 12.7307 | 12.7307 | 12.7307 | 4.2436 | 4.2436 | ... | 12.7307 | 12.7307 | 4.2436 | 4.2436 | 12.7307 | 12.7307 | 12.7307 | 12.7307 | 4.2436 | 12.7307 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 485485 | 62.4825 | 93.7238 | 31.2413 | 93.7238 | 31.2413 | 31.2413 | 93.7238 | 93.7238 | 31.2413 | 31.2413 | ... | 31.2413 | 93.7238 | 31.2413 | 93.7238 | 31.2413 | 93.7238 | 93.7238 | 93.7238 | 93.7238 | 31.2413 |
| 485486 | 65.7647 | 96.0036 | 33.9163 | 96.0036 | 33.9163 | 33.9163 | 96.0036 | 96.0036 | 33.9163 | 33.9163 | ... | 33.9163 | 96.0036 | 33.9163 | 96.0036 | 33.9163 | 96.0036 | 96.0036 | 96.0036 | 96.0036 | 33.9163 |
| 485487 | 65.7647 | 96.0036 | 33.9163 | 96.0036 | 33.9163 | 33.9163 | 96.0036 | 96.0036 | 33.9163 | 33.9163 | ... | 33.9163 | 96.0036 | 33.9163 | 96.0036 | 33.9163 | 96.0036 | 96.0036 | 96.0036 | 96.0036 | 33.9163 |
| 485488 | 65.7647 | 96.0036 | 33.9163 | 96.0036 | 33.9163 | 33.9163 | 96.0036 | 96.0036 | 33.9163 | 33.9163 | ... | 33.9163 | 96.0036 | 33.9163 | 96.0036 | 33.9163 | 96.0036 | 96.0036 | 96.0036 | 96.0036 | 33.9163 |
| 485489 | 62.4825 | 93.7238 | 31.2413 | 93.7238 | 31.2413 | 31.2413 | 93.7238 | 93.7238 | 31.2413 | 31.2413 | ... | 31.2413 | 93.7238 | 31.2413 | 93.7238 | 31.2413 | 93.7238 | 93.7238 | 93.7238 | 93.7238 | 31.2413 |
485490 rows × 81 columns
df['ANXMAT'].isna().sum()
170726
df['ANXMAT']
0 0.32
1 NaN
2 NaN
3 0.31
4 1.02
...
485485 NaN
485486 -0.20
485487 0.32
485488 -0.20
485489 NaN
Name: ANXMAT, Length: 485490, dtype: float64
For the visualization and anserwing the following questions the mathematics and total score for each student needs to be determined by calculating the mean of plausible values for the math section. The total score is determined by calculating the mean of all plausible values.
#Determine math score
math_pattern = re.compile(r'PV\dMA', re.IGNORECASE)
math_columns = df.filter(regex=math_pattern)
math_mean = math_columns.mean(axis = 1)
print(math_columns)
print(math_mean)
PV1MATH PV2MATH PV3MATH PV4MATH PV5MATH PV1MACC PV2MACC \
0 406.8469 376.4683 344.5319 321.1637 381.9209 325.8374 324.2795
1 486.1427 464.3325 453.4273 472.9008 476.0165 325.6816 419.9330
2 533.2684 481.0796 489.6479 490.4269 533.2684 611.1622 486.5322
3 412.2215 498.6836 415.3373 466.7472 454.2842 538.4094 511.9255
4 381.9209 328.1742 403.7311 418.5309 395.1628 373.3525 293.1220
... ... ... ... ... ... ... ...
485485 477.1849 493.5426 479.5217 486.5322 494.3215 507.5635 480.3007
485486 518.9360 515.8202 505.6940 596.8297 508.8098 592.1561 491.6732
485487 475.2376 482.2480 507.9530 457.3220 508.7319 557.8050 453.4273
485488 550.9503 517.4560 529.1401 515.8981 501.0983 574.3184 554.0661
485489 470.0187 441.1980 475.4713 441.9769 443.5348 467.6819 444.3138
PV3MACC PV4MACC PV5MACC ... PV1MAPF PV2MAPF PV3MAPF \
0 279.8800 267.4170 312.5954 ... 319.6059 345.3108 360.8895
1 378.6493 359.9548 384.1019 ... 411.3647 437.8486 457.3220
2 567.5417 541.0578 544.9525 ... 580.7836 481.0796 555.0787
3 553.9882 483.8838 479.2102 ... 534.5147 455.8420 504.1362
4 364.0053 430.2150 403.7311 ... 432.5518 431.7729 399.0575
... ... ... ... ... ... ... ...
485485 556.6365 502.8899 452.2589 ... 576.1100 488.8690 456.1536
485486 542.3041 556.3250 576.5773 ... 546.9777 581.2510 560.2197
485487 546.8998 514.1845 514.1845 ... 552.3524 514.1845 479.9112
485488 593.0129 495.6457 557.1818 ... 551.7292 469.9408 608.5917
485489 462.2293 436.5244 421.7246 ... 462.2293 460.6714 454.4399
PV4MAPF PV5MAPF PV1MAPI PV2MAPI PV3MAPI PV4MAPI PV5MAPI
0 390.4892 322.7216 290.7852 345.3108 326.6163 407.6258 367.1210
1 454.2063 460.4378 434.7328 448.7537 494.7110 429.2803 434.7328
2 453.8168 491.2058 527.0369 444.4695 516.1318 403.9648 476.4060
3 454.2842 483.8838 521.2728 481.5470 503.3572 469.8629 478.4312
4 369.4579 341.4161 297.0167 353.8791 347.6476 314.1533 311.0375
... ... ... ... ... ... ... ...
485485 530.9316 416.4278 527.0369 463.1640 423.4382 515.3529 397.7333
485486 567.2301 574.2405 470.6418 472.9787 476.0944 443.3790 470.6418
485487 421.4909 493.1531 489.2585 472.1218 458.1010 440.1854 488.4795
485488 541.6031 554.0661 462.9304 428.6571 483.1827 443.4569 521.3507
485489 438.0823 408.4826 403.8090 408.4826 431.8508 394.4618 374.2094
[485490 rows x 40 columns]
0 355.183832
1 432.240230
2 512.509733
3 510.640287
4 378.980370
...
485485 482.754318
485486 527.075863
485487 489.063728
485488 526.355352
485489 424.509273
Length: 485490, dtype: float64
#Add mean math score to data frame
df['math_score'] = math_mean
df.tail()
| Unnamed: 0 | CNT | SUBNATIO | STRATUM | OECD | NC | SCHOOLID | STIDSTD | ST01Q01 | ST02Q01 | ... | W_FSTR76 | W_FSTR77 | W_FSTR78 | W_FSTR79 | W_FSTR80 | WVARSTRR | VAR_UNIT | SENWGT_STU | VER_STU | math_score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 485485 | 485486 | Vietnam | 7040000 | VNM0317 | Non-OECD | Viet Nam | 162 | 4955 | 10 | 3.0 | ... | 93.7238 | 93.7238 | 93.7238 | 93.7238 | 31.2413 | 41 | 1 | 0.0653 | 22NOV13 | 482.754318 |
| 485486 | 485487 | Vietnam | 7040000 | VNM0317 | Non-OECD | Viet Nam | 162 | 4956 | 10 | 3.0 | ... | 96.0036 | 96.0036 | 96.0036 | 96.0036 | 33.9163 | 41 | 1 | 0.0688 | 22NOV13 | 527.075863 |
| 485487 | 485488 | Vietnam | 7040000 | VNM0317 | Non-OECD | Viet Nam | 162 | 4957 | 10 | 3.0 | ... | 96.0036 | 96.0036 | 96.0036 | 96.0036 | 33.9163 | 41 | 1 | 0.0688 | 22NOV13 | 489.063728 |
| 485488 | 485489 | Vietnam | 7040000 | VNM0317 | Non-OECD | Viet Nam | 162 | 4958 | 10 | 3.0 | ... | 96.0036 | 96.0036 | 96.0036 | 96.0036 | 33.9163 | 41 | 1 | 0.0688 | 22NOV13 | 526.355352 |
| 485489 | 485490 | Vietnam | 7040000 | VNM0317 | Non-OECD | Viet Nam | 162 | 4959 | 10 | 3.0 | ... | 93.7238 | 93.7238 | 93.7238 | 93.7238 | 31.2413 | 41 | 1 | 0.0653 | 22NOV13 | 424.509273 |
5 rows × 637 columns
#Determine total score
total_pattern = re.compile(r'PV', re.IGNORECASE)
total_columns = df.filter(regex=total_pattern)
total_mean = total_columns.mean(axis = 1)
print(total_columns)
print(total_mean)
PV1MATH PV2MATH PV3MATH PV4MATH PV5MATH PV1MACC PV2MACC \
0 406.8469 376.4683 344.5319 321.1637 381.9209 325.8374 324.2795
1 486.1427 464.3325 453.4273 472.9008 476.0165 325.6816 419.9330
2 533.2684 481.0796 489.6479 490.4269 533.2684 611.1622 486.5322
3 412.2215 498.6836 415.3373 466.7472 454.2842 538.4094 511.9255
4 381.9209 328.1742 403.7311 418.5309 395.1628 373.3525 293.1220
... ... ... ... ... ... ... ...
485485 477.1849 493.5426 479.5217 486.5322 494.3215 507.5635 480.3007
485486 518.9360 515.8202 505.6940 596.8297 508.8098 592.1561 491.6732
485487 475.2376 482.2480 507.9530 457.3220 508.7319 557.8050 453.4273
485488 550.9503 517.4560 529.1401 515.8981 501.0983 574.3184 554.0661
485489 470.0187 441.1980 475.4713 441.9769 443.5348 467.6819 444.3138
PV3MACC PV4MACC PV5MACC ... PV1READ PV2READ PV3READ \
0 279.8800 267.4170 312.5954 ... 249.5762 254.3420 406.8496
1 378.6493 359.9548 384.1019 ... 406.2936 349.8975 400.7334
2 567.5417 541.0578 544.9525 ... 401.2100 404.3872 387.7067
3 553.9882 483.8838 479.2102 ... 547.3630 481.4353 461.5776
4 364.0053 430.2150 403.7311 ... 311.7707 141.7883 293.5015
... ... ... ... ... ... ... ...
485485 556.6365 502.8899 452.2589 ... 460.2272 476.1134 472.9362
485486 542.3041 556.3250 576.5773 ... 490.9325 479.7053 448.4294
485487 546.8998 514.1845 514.1845 ... 462.6239 514.7503 434.5558
485488 593.0129 495.6457 557.1818 ... 505.2873 522.1282 513.3068
485489 462.2293 436.5244 421.7246 ... 532.3506 483.1034 479.9261
PV4READ PV5READ PV1SCIE PV2SCIE PV3SCIE PV4SCIE PV5SCIE
0 175.7053 218.5981 341.7009 408.8400 348.2283 367.8105 392.9877
1 369.7553 396.7618 548.9929 471.5964 471.5964 443.6218 454.8116
2 431.3938 401.2100 499.6643 428.7952 492.2044 512.7191 499.6643
3 425.0393 471.9036 438.6796 481.5740 448.9370 474.1141 426.5573
4 272.8495 260.1405 361.5628 275.7740 372.7527 403.5248 422.1746
... ... ... ... ... ... ... ...
485485 472.1419 481.6736 559.8098 528.1052 519.7128 535.5651 538.3626
485486 565.5134 451.6372 538.7355 493.9761 493.0436 561.1153 535.0056
485487 457.8122 511.5425 536.8706 571.3726 488.3812 548.9929 563.9127
485488 528.5437 522.9301 511.0407 532.4879 524.0955 551.1376 514.7706
485489 459.2741 488.6635 530.6229 473.7411 477.4711 477.4711 505.4457
[485490 rows x 50 columns]
0 347.439838
1 432.073398
2 499.186886
3 501.655846
4 365.501084
...
485485 487.096410
485486 522.822568
485487 493.067276
485488 525.598850
485489 437.768810
Length: 485490, dtype: float64
#Add total score to data frame
df['total_score'] = total_mean
df.tail()
| Unnamed: 0 | CNT | SUBNATIO | STRATUM | OECD | NC | SCHOOLID | STIDSTD | ST01Q01 | ST02Q01 | ... | W_FSTR77 | W_FSTR78 | W_FSTR79 | W_FSTR80 | WVARSTRR | VAR_UNIT | SENWGT_STU | VER_STU | math_score | total_score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 485485 | 485486 | Vietnam | 7040000 | VNM0317 | Non-OECD | Viet Nam | 162 | 4955 | 10 | 3.0 | ... | 93.7238 | 93.7238 | 93.7238 | 31.2413 | 41 | 1 | 0.0653 | 22NOV13 | 482.754318 | 487.096410 |
| 485486 | 485487 | Vietnam | 7040000 | VNM0317 | Non-OECD | Viet Nam | 162 | 4956 | 10 | 3.0 | ... | 96.0036 | 96.0036 | 96.0036 | 33.9163 | 41 | 1 | 0.0688 | 22NOV13 | 527.075863 | 522.822568 |
| 485487 | 485488 | Vietnam | 7040000 | VNM0317 | Non-OECD | Viet Nam | 162 | 4957 | 10 | 3.0 | ... | 96.0036 | 96.0036 | 96.0036 | 33.9163 | 41 | 1 | 0.0688 | 22NOV13 | 489.063728 | 493.067276 |
| 485488 | 485489 | Vietnam | 7040000 | VNM0317 | Non-OECD | Viet Nam | 162 | 4958 | 10 | 3.0 | ... | 96.0036 | 96.0036 | 96.0036 | 33.9163 | 41 | 1 | 0.0688 | 22NOV13 | 526.355352 | 525.598850 |
| 485489 | 485490 | Vietnam | 7040000 | VNM0317 | Non-OECD | Viet Nam | 162 | 4959 | 10 | 3.0 | ... | 93.7238 | 93.7238 | 93.7238 | 31.2413 | 41 | 1 | 0.0653 | 22NOV13 | 424.509273 | 437.768810 |
5 rows × 638 columns
There are 485490 student PISA results with 636 features. Most of the features are answers to survey questions answered by the participating students and are ordered factor variables with Agree, Strongly agree, Disagree, Strongly disagree or a numeric scale. Numeric variables are the resulting score in the different categories.
The influence of math anxiety and out-of-school learning hours towards the math PISA score in general and per gender.
I expect that students with a high anxiety towards math score lower in math. I also expect students which put a lot of effort into learning hours outside of school score higher in total.
I start inspecting the distribution of the main variable of interest the math and total score.
#Math score
plt.figure(figsize=[14, 6])
bins = np.arange(0, df['math_score'].max()+100, 100)
plt.hist(data = df, x = 'math_score', bins=bins)
plt.xlabel('Math score')
plt.title('Math score distribution')
plt.show()
The math score is almost normally distributed.
#Total score
plt.figure(figsize=[14, 6])
bins = np.arange(0, df['total_score'].max()+100, 100)
plt.hist(data = df, x = 'total_score')
plt.xlabel('Total score')
plt.title('Total score distribution')
plt.show()
The total score is also almost normally distributed. Next I'll have a look at the other distributions interesting for my analysis view the relationship with total score and the math score in the next step:
#Visualize math axienty score
plt.figure(figsize=[14, 6])
bins = np.arange(df['ANXMAT'].min()-0.5, df['ANXMAT'].max()+0.5, 0.5)
plt.hist(data = df, x = 'ANXMAT', bins=bins, edgecolor = 'black')
plt.xlabel('Anxiety')
plt.title('Math anxiety score distribution')
plt.show()
#Visualize with smaller bin size
plt.figure(figsize=[14,6])
bins = np.arange(df['ANXMAT'].min()-0.1, df['ANXMAT'].max()+0.1, 0.1)
plt.hist(data = df, x = 'ANXMAT', bins=bins)
plt.xlabel('Anxiety')
plt.title('Math anxiety score distribution')
plt.show()
The anxiety score is also normally distributed.
#Visualize learning hours outside of school on standard scale
plt.figure(figsize=[14, 6])
bins = np.arange(0, df['OUTHOURS'].max()+1, 1)
plt.hist(data = df, x = 'OUTHOURS', bins=bins)
plt.xlim([0,80])
plt.xlabel('Learning hours outside of school')
plt.title('Out-of-school learning hours distribution')
plt.show()
#Determine the possible limit
df.query('OUTHOURS>=80')['OUTHOURS'].sum()
42810.0
df.query('OUTHOURS<80')['OUTHOURS'].sum()
3386125.0
#Learning hours outside of school on log scale
plt.figure(figsize=[14, 6])
bins = 10 ** np.arange(0, np.log10(df['OUTHOURS'].max())+0.25, 0.25)
plt.hist(data = df, x = 'OUTHOURS', bins=bins)
plt.xscale('log')
plt.xlabel('Learning hours outside of school on log scale')
plt.title('Out-of-school learning hours distribution with log scale')
plt.show()
Outside hours has a long-tailed distribution, with a lot of students on the lower end of learning hours out of school, and few learning a lot of hours outside of school. When plotted on a log-scale, the price distribution looks normally distributed. Next I will investigate the specific survey questions related to anxiety.
Rubric Tip: Visualizations should depict the data appropriately so that the plots are easily interpretable. You should choose an appropriate plot type, data encodings, and formatting as needed. The formatting may include setting/adding the title, labels, legend, and comments. Also, do not overplot or incorrectly plot ordinal data.
fig, ax = plt.subplots(ncols=2, nrows=3, figsize = [12,12])
default_color = sb.color_palette()[0]
sb.countplot(data = df, x = 'ST42Q01', color = default_color, ax=ax[0,0])
ax[0,0].set_xlabel('Math Anxiety - Worry That It Will Be Difficult')
sb.countplot(data = df, x = 'ST42Q03', color = default_color, ax=ax[0,1])
ax[0,1].set_xlabel('Math Anxiety - Get Very Tense')
sb.countplot(data = df, x = 'ST42Q05', color = default_color, ax=ax[1,0])
ax[1,0].set_xlabel('Math Anxiety - Get Very Nervous')
sb.countplot(data = df, x = 'ST42Q08', color = default_color, ax=ax[1,1])
ax[1,1].set_xlabel('Math Anxiety - Feel Helpless')
sb.countplot(data = df, x = 'ST42Q10', color = default_color, ax=ax[2,0])
ax[2,0].set_xlabel('Math Anxiety - Worry About Getting Poor <Grades>')
fig.suptitle('Math anxiety distribution per question')
plt.show()
Students agree and strongly agree with the questions Worry that it will be difficult and Worry about getting poor grades. Students disagree with the emotians of feeling very nervous, helpless, and getting very tense.
#Gender
plt.figure(figsize=[14, 8])
default_color = sb.color_palette()[0]
sb.countplot(data = df, x = 'ST04Q01', color = default_color)
plt.title('Gender distribution')
plt.show()
A little bit more female than male students participated in the study.
The first variables I have looked into in my exploration (total score, math score and anxiety score are normally distributed. The out-of-school learning hours have a long-tailed distribution to the right. When applying a log scale the distribution gets normally distributed as well.
I calculated the math score and total score per student in building the mean from the relavent values per student and added the value to the data frame in the additional columns math_score and total_score.
I am starting the bivariate exploration in having a look at the relation between the math score and the anxiety score.
categoric_vars = ['ST42Q01', 'ST42Q03', 'ST42Q05', 'ST42Q08', 'ST42Q10']
#Math score vs. anxiety score
plt.figure(figsize = [14, 8])
plt.scatter(data = df, x = 'ANXMAT', y = 'math_score', s = 1)
plt.xlabel('Anxiety score')
plt.ylabel('Math score')
plt.title('Math score vs. Anxiety score')
plt.show()
The visualization shows that there is only a slight correlation between math anxiety and the math score. The more anxies students feel about math they score a bit low in their total and math score.
#Math score vs. gender
plt.figure(figsize = [14, 8])
sb.boxplot(data = df, x = 'ST04Q01', y = 'math_score',color = default_color)
plt.ylabel('Math score')
plt.xlabel('Gender')
plt.title('Math score per gender')
plt.show()
Male students are slightly scoring better in math than female students.
#Math anxiety vs. gender
plt.figure(figsize = [14, 8])
sb.violinplot(data = df, x = 'ST04Q01', y = 'ANXMAT',color = default_color)
plt.ylabel('Math anxiety score')
plt.xlabel('Gender')
plt.title('Math anxiety per gender')
plt.show()
Female students feeling more anxious about math than male students.
#Math score vs. learning hours outside of school
plt.figure(figsize = [14, 8])
sb.stripplot(data = df, x = 'OUTHOURS', y = 'math_score', jitter=0.3, size = 1)
plt.xlim([0, 80])
plt.xlabel('Learning hours outside of school')
plt.xticks([0, 20, 40, 60, 80])
plt.ylabel('Math score')
plt.title('Math score vs. Out-of-school learning hours')
plt.show()
#Math score vs. learning hours outside of school with log scale
plt.figure(figsize = [14, 8])
sb.stripplot(data = df, x = 'OUTHOURS', y = 'math_score', jitter=0.3, size = 1)
plt.xlabel('Learning hours outside of school')
plt.xscale('log')
plt.ylabel('Math score')
plt.title('Math score vs. Out-of-school learning hours with log scale')
plt.show()
The above visualizations show not a strong correlation between more learning hours outside of school and scoring better in math. Next I will have a look at the total score vs. the out-of-school learning hours.
#Total score vs. learning hours outside of school
plt.figure(figsize = [14, 8])
sb.stripplot(data = df, x = 'OUTHOURS', y = 'total_score', size = 1, jitter=0.3)
plt.xlim([0, 80])
plt.xlabel('Learning hours outside of school')
#plt.xscale('log')
plt.xticks([0, 20, 40, 60, 80])
plt.ylabel('Total score')
plt.title('Total score vs. Out-of-school learning hours')
plt.show()
#Total score vs. learning hours outside of school with log scale
plt.figure(figsize = [14, 8])
sb.stripplot(data = df, x = 'OUTHOURS', y = 'total_score', size = 1, jitter=0.3)
plt.xlabel('Learning hours outside of school')
plt.xscale('log')
plt.ylabel('Total score')
plt.title('Total score vs. Out-of-school learning hours with log scale')
plt.show()
The same as for the math score in relation to the out-of-school learning hours applies. Students are not scoring better the more time they are learning outside of school. Next I will have a closer look at the relation between math score and total score since the resulting visualization above had almost the same outcome.
#Visualize Math score vs. Total score
plt.figure(figsize = [14, 8])
plt.scatter(data = df, x = 'math_score', y = 'total_score', s = 1)
plt.xlabel('Math score')
plt.ylabel('Total score')
plt.title('Math score vs. Total score')
plt.show()
The relationship between math and total score is linear. Students score in the same range in math as they score in total. The last visualization in the bivariable section is looking at the math anxiety survey questions in relation to the math score.
#Visualize math anxiety qestions vs. math score
fig, ax = plt.subplots(nrows = 5, figsize = [14,8])
for i in range(len(categoric_vars)):
var = categoric_vars[i]
sb.violinplot(data = df, x = var, y = 'math_score', ax = ax[i],
color = default_color)
plt.tight_layout()
fig.suptitle('Math score per math anxiety question'.title(), y=1.02)
plt.show();
And the visualizations show a strong relation between anxiety and the math score. The more students disagree with questions the better they score in math.
The more anxious students feel the lower they score in their math score. There is also a correlation between anxiety and gender. Female students feel more anxious towards math than male students.
Math score and total score are linear to each other. Students are scoring on the same level in math and in total. Students that learn more hours outside of school do not score better in math or in total, so there is no strong correlation between out-of-school learning hours.
I am starting the multivariate exploration with a plot matrix of math score, total score, anxiety score and out-of-school learning hours.
#Plot matrices of the numeric varibales Math score, Total score, Anxienty score and Learning hours outside of school
numeric_var = ['math_score', 'total_score', 'ANXMAT', 'OUTHOURS']
g = sb.PairGrid(data = df, vars = numeric_var)
g.fig.set_size_inches(14, 8);
g.map_diag(plt.hist)
g.map_offdiag(plt.scatter)
plt.suptitle('Matrix of Math score, Total score, Math Anxiety score, Out-of-school learning hours'.title(), y=1.02);
There are no new findings which I have not already found in the bivariable exploration. Next I will have a look at the relation between total score and math score for two of the survey questions related to math anxiety.
def hist2dgrid(x, y, **kwargs):
""" Quick hack for creating heat maps with seaborn's PairGrid. """
palette = kwargs.pop('color')
bins_x = np.arange(df['math_score'].min()//100, df['math_score'].max()+100, 100)
bins_y = np.arange(df['total_score'].min()//100, df['total_score'].max()+100, 100)
plt.hist2d(x, y, cmap = palette, bins=[bins_x, bins_y], cmin = 0.5)
#Visualization for question `Worry That It Will Be Difficult`
g = sb.FacetGrid(data = df, col = 'ST42Q01', col_wrap = 3, height = 3)
g.fig.set_size_inches(14, 8);
g.map(hist2dgrid, 'math_score', 'total_score', color = 'viridis_r')
g.set_xlabels('Math score')
g.set_ylabels('Total score')
plt.suptitle('Distribution of math and total score by math anxiety - Worry That It Will Be Difficult'.title(), y = 1.02)
plt.show()
#Visualization for question `Get very tense`
g = sb.FacetGrid(data = df, col = 'ST42Q03', col_wrap = 3, height = 3)
g.fig.set_size_inches(14, 8);
g.map(hist2dgrid, 'math_score', 'total_score', color = 'viridis_r')
g.set_xlabels('Math score')
g.set_ylabels('Total score')
plt.suptitle('Distribution of math and total score by math anxiety - Get very tense'.title(), y = 1.02)
plt.show()
The more students agree to questions related to math anxiety they are more likely to score lower in their math and total score. Next up is the relation between math score, total score and anxiety score.
#Anxiety in relation to math and total score
fig, ax = plt.subplots(nrows = 1, ncols = 1, figsize = [14, 8]);
plt.scatter(data = df, x = 'math_score', y = 'total_score', c = 'ANXMAT', s= 1, cmap = 'viridis_r')
plt.colorbar(label ='Math anxiety score')
plt.xlabel('Math score')
plt.ylabel('Total score')
plt.title('Math score vs. Total score vs. Math anxiety score');
There is a trend in scoring higher the less anxious students feel but not that clear just from this type of visualization. Next exploration is the relation between math score, out-of-school learning hours and the math anxiety score.
fig, ax = plt.subplots(nrows = 1, ncols = 1, figsize = [14, 8]);
plt.scatter(data = df, x = 'OUTHOURS', y = 'math_score', c = 'ANXMAT', s= 1,
cmap = 'viridis_r')
plt.colorbar(label ='Math anxiety score')
plt.xlabel('Out-of-school learning hours')
plt.ylabel('Math score')
plt.title('Math score vs. Out-of-school learning hours vs. Math anxiety score');
In this visualization is a tendency visible that students that feel less anxious about math score higher in math. But even though some spent a lot of out-of-school learning hours they did not necessarily score higher in math. That is why I am looking into the correlation coefficients in the next visualization.
#Correlation heatmap for numeric variables Math score, Total score, Anxienty score and Learning hours outside of school
fig.set_size_inches(14, 8);
sb.heatmap(df[numeric_var].corr(), cmap = 'viridis_r', annot = True,
fmt = '.2f', vmin = 0)
plt.suptitle('Correlation heatmap for Math score, Total score, Math anxiety score and Out-of-school learning hours'.title(), y = 1.02);
There is a strong linear correlation between total score and the math score. There is no correlation between learning hours outside of school and math and total score as well as the anxiety score towards math. There is a negative correlation of -0.36 and -0.38 between anxiety towards math and the total and math score. As a result I will not further look into the variable learning hours outside of school. Next I will explore the influence of gender on the math score and the anxiety score.
#Math anxienty score vs. math score per gender
g = sb.FacetGrid(data = df, col = 'ST04Q01')
g.fig.set_size_inches(14, 8);
g.map(plt.scatter, 'ANXMAT','math_score', s=1)
g.set_xlabels('Anxiety score')
g.set_ylabels('Math score')
plt.suptitle('Math score vs. Anxiety score per gender'.title(), y = 1.02);
plt.show()
Also here you can see that female and male students have the same pattern. The more anxious they feel about math the lower they score in math but the visualization is not the best to see the difference.
# Math score in relation to the question Math Anxiety - Worry That It Will Be Difficult and gender'
fig = plt.figure(figsize = [14,6])
ax = sb.stripplot(data = df, x = 'ST42Q01', y = 'math_score', hue = 'ST04Q01',
palette = 'Blues', size = 1, jitter = 0.3, dodge = True)
plt.title('Math score in relation to the question Math Anxiety - Worry That It Will Be Difficult and gender')
plt.legend(title ='Gender')
plt.ylabel('Math score')
plt.xlabel('Math Anxiety - Worry That It Will Be Difficult')
plt.show();
/Users/tanja/anaconda3/lib/python3.10/site-packages/IPython/core/pylabtools.py:152: UserWarning: Creating legend with loc="best" can be slow with large amounts of data. fig.canvas.print_figure(bytes_io, **kw)
This visualization shows the answers to the question Worry That It Will Be Difficult in relation to the math score. In the visualization one can see the more students disagree with the survey question they score higher in math. There is also a difference between gender. Male students score higher than female students for each answer. The only exception is for Strongly disagree where female students have a higher math score than the male students.
# Math score in relation to the question Math Anxiety - Get very tense and gender
fig = plt.figure(figsize = [14,6])
ax = sb.stripplot(data = df, x = 'ST42Q03', y = 'math_score', hue = 'ST04Q01',
palette = 'Blues', size = 1, jitter = 0.3, dodge = True)
plt.title('Math score in relation to the question Math Anxiety - Worry About Getting Poor Grades and gender')
plt.legend(title ='Gender')
plt.ylabel('Math score')
plt.xlabel('Math Anxiety - Get very tense')
plt.show();
This visualization shows the answers to the question Get very tense in relation to the math score. In the visualization one can see the more students disagree with the survey question they score higher in math. There is also a difference between gender. Male students score higher than female students for each answer.
Female students score lower in math and they also feel more anxious about math. Male students score higher in math and they feel less anxious about math. Out-of-school learning hours do not have a strong influence on the math or total score.
Surpising for me was that the math score and total score have such a strong correlation and that math anxiety has an influence on the total score are well. I expected that there would be some students that are scoring very good in math but might score less in the other categories leading to a lower total score.
There is a strong relation between feeling anxious about math and scoring low in math or in total. The more students disagree with questions of the survey about math anxiety the better they scored in math. Female students feel more anxious about math than male students. Surprisingly out-of-school learning hours do not have an influence on the math or total score. Some students score on a high level with high numbers of out-of-school learning hours but there are also students that learn a lot of hours outside of school but still do not score on a high level.